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Abstract 

Human communication typically has an underlying 
structure. This is reflected in the fact that in many user gen¬ 
erated videos, a starting point, ending, and certain objec¬ 
tive steps between these two can be identified. In this paper, 
we propose a method for parsing a video into such seman¬ 
tic steps in an unsupervised way. The proposed method is 
capable of providing a “semantic storyline ” of the video 
composed of its objective steps. We accomplish this using 
both visual and language cues in a joint generative model. 
The proposed method can also provide a textual description 
for each of the identified semantic steps. We evaluate this 
method on a large number of complex YouTube videos and 
show results of unprecedented quality for this intricate and 
impactful problem. 

1. Introduction 

Human communication takes many forms, including lan¬ 
guage and vision. For instance, explaining “how-to” per¬ 
form a certain task can be communicated via language ( e.g., 
Do-It-Yourself books) as well as visual (e.g., instructional 
YouTube videos) information. Regardless of the form, 
such human-generated communication is generally struc¬ 
tured and has a clear beginning, end, and a set of steps in be¬ 
tween. Parsing such communication into its semantic steps 
is the key to understanding structured human activities. 

Language and vision provide different, but correlating 
and complementary information. Challenge lies in that both 
video frames and language (from subtitles generated via 
ASR) are only a noisy, partial observation of the actions 
being performed. However, the complementary nature of 
language and vision gives the opportunity to understand the 
activities from these partial observations. In this paper, we 
present a unified model, incorporating both of the modal¬ 
ities, in order to parse human activities into activity steps 
with no form of supervision other than requiring videos to 
be of the same category (e.g., videos retrieved by query 
cooking eggs, changing tires, etc.). 



Figure 1: Given a large video collection (frames and subtitles) 
of an structured category (e.g., How to cook an omelette?), we 
discover activity steps (e.g., crack the eggs). We also parse the 
videos based on the discovered steps. 


The key idea in our approach is the observation that the 
large collection of videos, pertaining to the same activity 
class, typically include only a few objective activity steps, 
and the variability is the result of exponentially many ways 
of generating videos from activity steps through subset se¬ 
lection and time ordering. We study this construction based 
on the large-scale information available in YouTube in the 
form of instructional videos (e.g., “Making pancake”, “How 
to tie a bow tie”). We adopt Instructional videos since they 
have many desirable properties like the volume of the infor¬ 
mation (e.g., YouTube has 281.000 videos for ” How to tie 
a bow tie ”) and a well defined notion of activity step. How¬ 
ever, the proposed parsing method is applicable to any type 
of structured videos as long as they are composed of a set 
of objective steps. 

The output of our method can be seen as the “seman¬ 
tic storyline” of a rather long and complex video collection 
(see Fig. 1). This storyline provides what particular steps 
are taking place in the video collection, when they are oc¬ 
curring, and what their meaning is (what-when-how). This 
method also puts videos performing the same overall task 
in common ground and capture their high-level relations. 

In the proposed approach, given a collection of videos, 
we first generate a set of language and visual atoms. These 
atoms are the result of relating object proposals from each 
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frame as well as detecting the frequent words from sub¬ 
titles. We then employ a generative beta process mixture 
model , which identifies the activity steps shared among the 
videos of the same category based on a representation us¬ 
ing learned atoms. The discovered steps are found to be 
highly correlating with semantic steps since the semantics 
are the strongest common structure among all of the videos 
of one category. In our method, we use neither any spatial 
or temporal label on actions/steps nor any labels on object 
categories. We later learn a Markov language model to pro¬ 
vide a textual description for each of the activity steps based 
on the language atoms it frequently uses. 

2. Related Work 

Three aspects differentiate this work from the majority 
of existing techniques: 1) discovering semantic steps from a 
video category, 2) being unsupervised, 3) adopting a multi¬ 
modal joint vision-language model for video parsing. A 
thorough review of the related literature is provided below. 
Video Summarization: Summarizing an input video as a 
sequence of key frames (static) or video clips (dynamic) is 
useful for both multimedia search interfaces and retrieval 
purposes. Early works in the area are summarized in [5 ( ] 
and mostly focus on choosing keyframes for visualization. 

Summarizing videos is particularly important for long 
sequences like ego-centric videos and news reports [35, 38, 
50]; however, these methods mostly rely on characteristics 
of the application and do not generalize. 

Summarization is also applied to the large image collec¬ 
tions by recovering the temporal ordering and visual simi¬ 
larity of images [26], and by Gupta et al. [17] to videos in a 
supervised framework using annotations of actions. These 
collections are also used to choose important scenes for key- 
frame selection [24] and further extended to video clip se¬ 
lection [25, 48]. Unlike all of these methods which focus 
on forming a set of key frames/clips for a compact sum¬ 
mary (which is not necessarily semantically meaningful), 
we provide a fresh approach to video summarization by per¬ 
forming it through semantic parsing on vision and language. 
However, regardless of this dissimilarity, we experimentally 
compare our method against them. 

Modeling Visual and Language Information: Learning 
the relationship between the visual and language data is 
a crucial problem due to its immense applications. Early 
methods [4] in this area focus on learning a common multi¬ 
modal space in order to jointly represent language and vi¬ 
sion. They are further extended to learning higher level re¬ 
lations between object segments and words [5 ]. Similarly, 
Zitnick et al. [63, 62] used abstracted clip-arts to understand 
spatial relations of objects and their language correspon¬ 
dences. Kong et al. [2< ] and Fidler et al. [1 ] both accom¬ 
plished the task of learning spatial reasoning using the im¬ 
age captions. Relations extracted from image-caption pairs, 
are further used to help semantic parsing [61] and activ¬ 


ity recognition [41]. Recent works also focus on automatic 
generation of image captions with underlying ideas ranging 
from finding similar images and transferring their captions 
[45] to learning language models conditioned on the image 
features [27, 55, 12]; their employed approach to learning 
language models is typically either based on graphical mod¬ 
els [1 ] or neural networks [55, 27, 23]. 

All aforementioned methods use supervised labels either 
as strong image-word pairs or weak image-caption pairs, 
while our method is fully unsupervised. 

Activity/Event Recognition: The literature of activity 
recognition is broad. The closest techniques to ours are ei¬ 
ther supervised or focus on detecting a particular (and of¬ 
ten short) action in a weakly/unsupervised manner. Also, a 
large body of action recognition methods are intended for 
trimmed videos clips or remain limited to detecting very 
short actions [30, 56, 42, 33, 11, 5 ]. Even though some 
recent works attempted action recognition in untrimmed 
videos [21, 44, 20], they are mostly fully supervised. 

Additionally, several method for localizing instances of 
actions in rather longer video sequences have been de¬ 
veloped [10, 18, 34, 6, 47]. Our work is different from 
those in terms of being multimodal, unsupervised, appli¬ 
cable to a video collection, and not limited to identifying 
predefined actions or the ones with short temporal spans. 
Also, the previous works on finding action primitives such 
as [42, 60, 19, 32, 31] are primarily limited to discovering 
atomic sub-actions, and therefore, fail to identify complex 
and high-level parts of a long video. 

Recently, event recounting has attracted much interest 
and intends to identify the evidential segments for which a 
video belongs to a certain class [57, 9, 3]. Event recounting 
is a relatively new topic and the existing methods mostly 
employ a supervised approach. Also, their end goal is to 
identify what parts of a video are highly related to an event, 
and not parsing the video into semantic steps. 

Recipe Understanding: Following the interest in commu¬ 
nity generated recipes in the web, there have been many 
attempts to automatically process recipes. Recent methods 
on natural language processing [40, 58] focus on semantic 
parsing of language recipes in order to extract actions and 
the objects in the form of predicates. Tenorth et al. [5< ] fur¬ 
ther process the predicates in order to form a complete logic 
plan. The aforementioned approaches focus only on the lan¬ 
guage modality and they are not applicable to the videos. 
The recent advances [5, ] in robotics use the parsed recipe 
in order to perform cooking tasks. They use supervised ob¬ 
ject detectors and report a successful autonomous experi¬ 
ment. In addition to the language based approaches, Mal- 
maud et al. [39] consider both language and vision modali¬ 
ties and propose a method to align an input video to a recipe. 
However, this method can not extract the steps automati¬ 
cally and requires a ground truth recipe to align. On the 
contrary, our method uses both visual and language modal- 


ities and extracts the actions while autonomously discov¬ 
ering the steps. Also, [15] generates multi-modal recipes 
from expert demonstrations . However, it is developed only 
for the domain of “teaching user interfaces” and are not ap¬ 
plicable to videos. 

3. Overview 

Given a large video-collection, our algorithm starts with 
learning a set of visual and language atoms which are fur¬ 
ther used for representing multimodal information (Sec¬ 
tion 4). These atoms are designed to be more likely to cor¬ 
respond to the mid-level semantic concepts like actions and 
objects. In order to learn visual atoms, we generate object 
proposals and cluster them into mid-level atoms. Whereas, 
for the language atoms we simply use the salient and fre¬ 
quent words in the subtitles. After learning the atoms, we 
represent the multi-modal information in each frame based 
on the occurrence statistics of the atoms (Section 4); Given 
the sequence of multi-modal frame representations, we dis¬ 
cover a set of clusters occurring over multiple videos using a 
non-parametric Bayesian method (Section 5.1). We expect 
these clusters to correspond to the activity steps which con¬ 
struct the high level activities. Our empirical results con¬ 
firms this as the resulting clusters significantly correlates 
with the activity steps. 

4. Forming the Multi-Modal Representation 

Finding the set of activity steps over large collection of 
videos having large visual varieties requires us to represent 
the semantic information in addition to the low-level visual 
cues. Hence, we find our language and visual atoms by us¬ 
ing mid-level cues like object proposals and frequent words. 



Visual Atoms 


Figure 2: We learn language and visual atoms to represent multi¬ 
modal information. Language atoms are frequent words and visual 
atoms are the clusters of object proposals. 


In order to group this object proposals into mid-level vi¬ 
sual atoms, we follow a clustering approach. Although any 
graph clustering approach ( e.g ., Key segments [36]) can be 
applied for this, the joint processing of a large video collec¬ 
tion requires handling large visual variability among mul¬ 
tiple videos. We propose a new method to jointly cluster 
object proposals over multiple videos in Section 5. Each 
cluster of object proposals correspond to a visual atom. 
Learning Language Atoms: We define the language atoms 
as the salient words which occur more often than their ordi¬ 
nary rates based on the tf-idf measure. The document is de¬ 
fined as the concatenation of all subtitles of all frames of all 
videos in the collection. Then, we follow the classical tf-idf 
measure and use it as tfidf(w , D ) = f w ? £> x log ^1 + ^ 

where w is the word we are computing the tf-idf score for, 
fw,D is the frequency of the word in the document D , N 
is the total number of video collections we are processing, 
and n w is the number of video collections whose subtitle 
include the word w. 

We sort words with their “tf-idf’ values and choose the 
top K words as language atoms ( K = 100 in our ex¬ 
periments). As an example, we show the language atoms 
learned for the category making scrambled egg in Figure 2. 
Representing Frames with Atoms: After learning the vi¬ 
sual and language atoms, we represent each frame via the 
occurrence of atoms (binary histogram). Formally, the rep¬ 
resentation of the t th frame of the i th video is denoted as 
yj^ and computed as = [y^ 5 l ? y^ ,v ] such that k th 
entry of the y ^’ 1 is 1 if the subtitle of the frame has the 
k th language atom and 0 otherwise, y ^ ’ v is also a binary 
vector similarly defined over visual atoms. We visualize the 
representation of a sample frame in the Figure 3. 



Learning Visual Atoms: In order to learn visual atoms, 
we create a large collection of proposals by independently 
generating object proposals from each frame of each video. 
These proposals are generated using the Constrained Para¬ 
metric Min-Cut (CPMC) [8] algorithm based on both ap¬ 
pearance and motion cues. We note the k th proposal of t th 
frame of i tn video as r t v . Moreover, we drop the video 
index (i) if it is clearly implied in the context. 


Figure 3: Representation for a sample frame. Three of the ob¬ 
ject proposals of sample frame are in the visual atoms and three of 
the words are in the language atoms. 

5. Joint Proposal Clustering over Videos 

Given a set of object proposals generated from multiple 
videos , simply combining them into a single collection and 
clustering them into atoms is not desirable for two reasons: 









(1) semantic concepts have large visual differences among 
different videos and accurately clustering them into a sin¬ 
gle atom is hard, (2) atoms should contain object propos¬ 
als from multiple videos in order to semantically relate the 
videos. In order to satisfy these requirements, we propose a 
joint extension to spectral clustering. Note that the purpose 
of this clustering is generating atoms where each clusters 
represents an atom. 



r ... 


Proposal Graph for Video 1 


Proposal Graph for Video 2 


Proposal Graph for Video 3 


Figure 4: Joint proposal clustering. Each object proposal is 
linked to its two NNs from the video it belongs and two NNs from 
the videos it is neighbour of. Dashed and solid lines denote the 
intra-video and inter-video edges, respectively. Black nodes are 
the proposals selected as part of the cluster and the gray ones are 
not selected. Similarly, the black and gray edges denote selected 
and not-selected, respectively. 


Basic Graph Clustering: Consider the set of object pro¬ 
posals extracted from a single video {r^}, and a pairwise 
similarity metric d(-, •) for them. Single cluster graph par¬ 
titioning (SCGP)[43] approach finds the dominant cluster 
which maximizes the intra-cluster similarity: 
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where x% is a binary variable which is 1 if is included in 
the cluster, T is the number of frames and K is the num¬ 
ber of clusters per frame. Adopting the vector form of the 
indicator variables as x t K+k = Xt and the pairwise dis¬ 
tance matrix as A tl K+k u t 2 K+k 2 = d ( r ti > r t 2 2 )* equation 
(1) can be compactly written as argmax x x xT ^ x . This can 
be solved by finding the dominant eigenvector of x after re¬ 
laxing x\ to [0,1] [43, 46]. Upon finding the cluster, the 
members of the selected cluster are removed from the col¬ 
lection and the same algorithm is applied to find remaining 
clusters. 

Joint Clustering: Our extension of the SCGP into multiple 
videos is based on the assumption that the key objects occur 
in most of the videos. Hence, we re-formulate the problem 
by enforcing the homogeneity of the cluster over all videos. 

We first create a kNN graph of the videos based on the 
distance between their textual descriptions. We use the x 2 
distance of the bag-of-words computed from the video de¬ 


scription. We also create the kNN graph of object propos¬ 
als in each video based on the pretrained ”fc7” features of 
AlexNet [29]. This hierarchical graph structure is visual¬ 
ized in Figure 4 for three videos samples. After creating this 
graph, we impose both “inter-video” and “intra-video” sim¬ 
ilarity among the object proposals of each cluster. Main ra¬ 
tionale behind this construction is having a separate notion 
of distance for inter-video and intra-video relations since 
the visual similarity decreases drastically for inter-video 
ones. 

Given the intra-video distance matrices A^, the binary 
indicator vectors x^, and the inter-video distance matrices 
as we define our optimization problem as: 
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where A f(i) is the neighbours of the video i in the kNN 
graph, 1 is vector of ones and N is the number of videos. 

Although we can not use the efficient eigen- 
decomposition approach from [43, 46] as a result of 
the modification, we can use Stochastic Gradient Descent 
as the cost function is quasi-convex when relaxed. We use 
the SGD with the following analytic gradient function: 
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We iteratively use the method to find clusters, and stop 
after the A = 20 clusters are found as the remaining object 
proposals were deemed not relevant to the activity. Each 
cluster corresponds to a visual atom for our application. 

In Figure 5, we visualize some of the atoms (i.e., clus¬ 
ters) we learned for the query How to Hard Boil an Egg?. 
As apparent in the figure, the resulting atoms are highly 
correlated and correspond to semantic objects&concepts re¬ 
gardless of their significant intra-class variability. 



Figure 5: Randomly selected images of four randomly selected 
clusters learned for How to hard boil an egg? 


5.1. Unsupervised Parsing 

In this section, we explain the model which we use to 
discover the activity steps from a video collection given the 
language and visual atoms. We note the extracted repre- 




















sentation of the frame t of video i as . We model our 
algorithm based on activity steps and note the activity label 
of the t th frame of the i th video as z[ 1 ^. We do not fix the 
the number of activities and use a non-parametric approach. 

In our model, each activity step is represented over the 
atoms as the likelihood of including them. In other words, 
each activity step is a Bernoulli distribution over the visual 
and language atoms as 0 k = [0 l k ,0' k \ such that m th en¬ 
try of the 0 l k is the likelihood of observing m th language 
atom in the frame of an activity k. Similarly, m th entry 
of the 0 k represents the likelihood of seeing m th visual 
atom. In other words, each frame’s representation is 
sampled from the distribution corresponding to its activity 
as y^| zf* = k ~ Ber(O k ). As a prior over 0 , we use its 
conjugate distribution - Beta distribution. 

Given the model above, we explain the generative model 
which links activity steps and frames in Section 5.1.1. 

5.1.1 Beta Process Hidden Markov Model 

For the understanding of the time-series information, Fox et 
al. [14] proposed the Beta Process Hidden Markov Models 
(BP-HMM). In BP-HMM setting, each time-series exhibits 
a subset of available features. Similarly, in our setup each 
video exhibits a subset of activity steps. 

Our model follows the construction of Fox et al. \b ] 
and differs in the choice of probability distributions since 
[14] considers Gaussian observations while we adopt binary 
observations of atoms. In our model, each video i chooses 
a set of activity steps through an activity step vector f W 
such that is 1 if i th video has the activity step fc, and 0 
otherwise. When the activity step vectors of all videos are 
concatenated, it becomes an activity step matrix F such that 
i th row of the F is the activity step vector fW. Moreover, 
each activity step k also has a prior probability b k and a 
distribution parameter 0 k which is the Bernoulli distribution 
as we explained in the Section 5.1. 

In this setting, the activity step parameters 0 k and b k fol¬ 
low the beta process as; 

oo 

B\B 0n ,p~BP(p, 7 B o ),B = J2bk6e k (4) 

k=l 

where B 0 and the b k are determined by the underlying Pois¬ 
son process [16] and the feature vector is determined as 
independent Bernoulli draws as ~ Ber(b k ). After 
marginalizing over the b k and 0 k , this distribution is shown 
to be equivalent to Indian Buffet Process (IBP) [16]. In the 
IBP analogy, each video is a customer and each activity step 
is a dish in the buffet. The first customer (video) chooses a 
Poisson (7) unique dishes (activity steps). The following 
customer (video) i chooses previously sampled dish (activ¬ 
ity step) k with probability ^, proportional to the number 
of customers (m k ) chosen the dish k , and it also chooses 
Poisson(4) new dishes (activity steps). Here, 7 controls the 


number of selected activities in each video and (3 promotes 
the activities getting shared by videos. 

The above IBP construction represents the activity step 
discovery part of our method. In addition, we need to model 
the video parsing over discovered steps; these two need to 
be modeled jointly. We model the each video as an Hid¬ 
den Markov Model (HMM) over the selected activity steps. 

(i) 

Each frame has the hidden state -activity step- (z t and we 
observe the multi-modal frame representation y^. Since 
we model each activity step as a Bernoulli distribution, the 
emission probabilities follow the Bernoulli distribution as 

P(yt l} k W ) = B er{0 «). 

z t 

For the transition probabilities of the HMM, we do not 
put any constraint and simply model it as any point from 
a probability simplex which can be sampled by drawing a 
set of Gamma random variables and normalizing them [14]. 
For each video i, a Gamma random variable is sampled for 
the transition between activity step j and activity step k if 
both of the activity steps are included in the video ( i.e . if 
and /j are both 1). After sampling these random variables, 
we normalize them to make transition probabilities to sum 
to 1. This procedure can be represented formally as 


(f 
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where n is the persistence parameter promoting the self 
state transitions a.k.a. more coherent temporal boundaries, 
o is the element-wise product, and 7r] is the transition prob¬ 
abilities in video i from activity step j to other steps. This 
model is also presented as a graphical model in Figure 6. 


i th video 



Figure 6: Graphical model for BP-HMM: The left plate repre¬ 
sent the activity steps and the right plate represent the videos, (i.e. 
the left plate is for the activity step discovery and right plate is for 
parsing.) See Section 5.1.1 for details. 


5.1.2 Gibbs sampling for BP-HMM 

We employ Markov Chain Monte Carlo (MCMC) method 
for learning and inference of the BP-HMM. We base our 
algorithms on the MCMC procedure proposed by Fox et 












al. [14]. Our sampling procedure composed of two sam¬ 
plers: (1) activity step (f W) sampler from the current activ¬ 
ity step distributions 6 k and multi-modal frame representa¬ 
tions , (2) and HMM parameter sampler from the 

selected activities fM. Intuitively, we iterate over discover¬ 
ing activity steps given the temporal activity labels and es¬ 
timating activity labels given the discovered activities. We 
give the details of this sampler in [1]. 

6. Experiments 

In order to experiment the proposed method, we first col¬ 
lected a dataset (details in Section 6.1). We labelled a small 
part of the dataset with frame-wise activity step labels and 
used the resulting set as a test corpus. Neither the set of la¬ 
bels, nor the temporal boundaries are exposed to our algo¬ 
rithm since the setup is completely unsupervised. We eval¬ 
uate our algorithm against the several unsupervised clus¬ 
tering baselines and state-of-the-art algorithms from video 
summarization literature which are applicable. 

6.1. Dataset 

We use WikiHow [2] in order to obtain the top 100 
queries the internet users are interested in and choose the 
ones which are directly related to the physical world. Re¬ 
sulting queries are; 

How toBake Boneless Skinless Chicken, Tie a Tie, Clean a Coffee 
Maker, Make Jello Shots, Cook Steak, Bake Chicken Breast, Hard Boil an 
Egg, Make Yogurt, Make a Milkshake, Make Beef Jerky, Make Scrambled 
Eggs, Broil Steak, Cook an Omelet, Make Ice Cream, Make Pancakes, Re¬ 
move Gum from Clothes, Unclog a Bathtub Drain 

For each of the queries, we crawled YouTube and got the 
top 100 videos. We also downloaded the English subtitles 
if they exist. For the test set, we randomly choose 5 videos 
out of 100 per query. 

6.1.1 Outlier Detection 

Since we do not have any expert intervention in our data col¬ 
lection, the resulting collection might have outliers, mainly 
due to fact that our queries are typical daily activities and 
there are many cartoons, funny videos, and music videos 
about them. Hence, we have an automatic coarse filtering 
stage. The key-idea behind the filtering algorithm is the 
fact that instructional videos have a distinguishable text de¬ 
scriptions when compared with outliers. Hence, we use a 
clustering algorithm to find the dominating cluster of in¬ 
structional videos free of outliers. Given a large video col¬ 
lection, we use the graph, explained in Section 5, and com¬ 
pute the dominant video cluster by using the Single Cluster 
Graph Partitioning [43] and discards the remaining videos 
as outlier. In Figure 7, we visualize some of the discarded 
videos. Although our algorithm have a few percentage 
of false positives while detecting outliers, we always have 


enough number of videos (minimum 50) after the outlier 
detection, thanks to the large-scale dataset. 



Figure 7: Sample videos which our algorithm discards as an 
outlier for various queries. A toy milkshake, a milkshake charm, 
a funny video about How to NOT make smoothie, a video about 
the danger of a fire, a cartoon video, a neck-tie video erroneously 
labeled as bow-tie, a song, and a lamb cooking mislabeled as 
chicken. 

6.2. Qualitative Results 

After independently running our algorithm on all cate¬ 
gories, we discover activity steps and parse the videos ac¬ 
cording to discovered steps. We visualize some of these cat¬ 
egories qualitatively in Figure 8 with the temporal parsing 
of evaluation videos as well as the ground truth parsing. 

To visualize the content of each activity step, we display 
key-frames from different videos. We also train a 3 rd order 
Markov language model [53] using the subtitles and employ 
it to generate a caption for each activity step by sampling 
this model conditioned on the 0 l k . We explain the details of 
this process in [1]. 

As shown in the Figures 8a and 8b, resulting steps are 
semantically meaningful; hence, we conclude that there is 
enough language context within the subtitles in order to de¬ 
tect activities. However, some of the activity steps occur 
together and our algorithm merges them into a single step 
as a result of promoting sparsity. 

6.3. Quantitative Results 

We compare our algorithm with the following baselines. 
Low-level features (LLF): In order to experiment the effect 
of learned atoms, we compare them low-level features. As 
features, we use the Fisher vector representation of Dense 
Trajectory like features (HOG, HOF, and MBH) [22]. 
Single modality: To experiment the effect of multi-modal 
approach, we compare with single modalities by only using 
the atoms of one modality. 

Hidden Markov Model (HMM): To experiment the effect 
of joint generative model, we compare our algorithm with 
an HMM (using the Baum-Welch [4< ] via cross-validation). 
Kernel Temporal Segmentation [48]: Kernel Temporal 
Segmentation (KTS) proposed by Potapov et al. [48] can 
detect the temporal boundaries of the events/activities in the 
video from a time series data without any supervision. It en¬ 
forces a local similarity of each resultant segment. 

Given parsing results and the ground truth, we evaluate 
both the quality of temporal segmentation and the activity 
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(b) How to make a milkshake? 

Figure 8: Temporal segmentation of the videos and ground truth segmentation. We also color code the activity steps we discovered and 
visualize their key-frames and the automatically generated captions. Best viewed in color. 
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Figure 9: IOU C ms values for all categories, for all competing algorithms. 
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Figure 10: AP C ms values for all categories, for all competing algorithms. 


step discovery. We base our evaluation on two widely used 
metrics; intersection over union (IOU) and mean average 
precision(ra AP). IOU measures the quality of temporal 

segmentation and it is defined as; =■ Yu=i uur r where N 
is the number of segments, r* is ground truth segment and 
t[ is the detected segment, m AP is defined per activity step 
and can be computed based on a precision-recall curve [21]. 
In order to adopt these metrics into unsupervised setting, we 
use cluster similarity measure(csm)[37] which enables us to 


use any metric in unsupervised setting. It chooses a match¬ 
ing of ground truth labels with predicted labels by searching 
over matchings and choosing the ones giving highest score. 
Therefore, mAP csrn and IOU csrn are our final metrics. 

Accuracy of the temporal parsing. We compute, and plot 
in Figure9, the IOU cm s values for all competing algorithms 
and all categories. We also average over the categories and 
summarize the results in the Table 1 . As the Figure 9 and 
Table 1 suggest, proposed method consistently outperforms 
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Figure 11: Qualitative results for parsing Travel San Francisco’ category. 


Table 1: Average IOU C ms and mAP crns over all categories. 



KTS [48] 

KTS [48] 
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w/o Lang 
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16.80 

28.01 

30.84 

37.69 

33.16 

36.50 

29.91 

52.36 

mAP cms 

n/a 

n/a 

9.35 

32.30 

11.33 

30.50 

19.50 

44.09 


Table 2: Semantic mean-average-precision mAP sern . 
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6.44 

24.83 

7.28 

28.93 

14.83 

39.01 


the competing algorithms and its variations. One interesting 
observation is the importance of both modalities reflected in 
the dramatic difference between the accuracy of our method 
and its single modal versions. 

Moreover, the difference between our method and HMM 
is also significant. We believe this is due to the ill-posed 
definition of activities in HMM since the granularity of the 
activity steps is subjective. In contrast, our method starts 
with the well-defined definition of finding set of steps which 
generate the entire collection. Hence, our algorithm do not 
suffer from granularity problem. 

Coherency and accuracy of activity step discovery. Al¬ 
though IOU crns successfully measures the accuracy of the 
temporal segmentation, it can not measure the quality of 
discovered activities. In other words, we also need to eval¬ 
uate the consistency of the activity steps detected over mul¬ 
tiple videos. For this, we use unsupervised version of mean 
average precision mAP cms . We plot the mAP cms val¬ 
ues per category in Figure 10 and their average over cat¬ 
egories in Table 1. As the Figure 10 and the Table 1 sug¬ 
gests, our proposed method outperforms all competing al¬ 
gorithms. One interesting observation is the significant dif¬ 
ference between semantic and low-level features. Hence, 
our mid-level features play a key role in linking videos. 

Semantics of activity steps. In order to evaluate the role of 
semantics, we performed a subjective analysis. We concate¬ 
nated the activity step labels in the grount-truth into a label 
collection. Then, we ask non-expert users to choose a la¬ 
bel for each discovered activity for each algorithm. In other 
words, we replaced the maximization step with subjective 
labels. We designed our experiments in a way that each clip 
received annotations from 5 different users. We randomized 
the ordering of videos and algorithms during the subjective 
evaluation. Using the labels provided by subjects, we com¬ 
pute the mean average precision ( mAP sern ). 

Both mAP crns and mAP sern metrics suggest that 


our method consistently outperforms the competing ones. 
There is only one recipe in which our method is outper¬ 
formed by our baseline of no visual information. This is 
mostly because of the specific nature of the recipe How to 
tie a tie?. In such videos the notion of object is not useful 
since all videos use a single object -tie-. 

The importance of each modality. As shown in Figure 9 
and 10, the performance, consistently across all categories, 
drops when any of the modalities is ignored. Hence, the 
joint usage is necessary. One interesting observation is 
the fact that using only language information performed 
slightly better than using only visual information. We be¬ 
lieve this is due to the less intra-class variance in the lan¬ 
guage modality (i.e., people use same words for same ac¬ 
tivities). However, it lacks many details(less complete) and 
is more noisy than visual information. Hence these results 
validate the complementary nature of language and vision. 

Generalization to generic structured videos We exper¬ 
iment the applicability of our method beyond How-To 
videos by evaluating it on non-How-To categories. In Fig¬ 
ure 11, we visualize the results for the videos retrieved us¬ 
ing the query “Travel San Francisco”. The resulting clusters 
follow semantically meaningful activities and landmarks 
and show the applicability of our method beyond How-To 
queries. It is interesting to note that Chinatown and Clement 
St ended up in the same cluster; considering the fact that 
Clement St is known for its Chinese food, this shows suc¬ 
cessful utilization of semantic connections. 

7. Conclusions 

In this paper, we tried to capture the underlying structure 
of human communication by jointly considering visual and 
language cues. We experimentally validated that given a 
large-video collection having subtitles, it is possible to dis¬ 
cover activities without any supervision over activities or 
objects. Experimental evaluation also suggested the avail¬ 
able noisy and incomplete information is powerful enough 
to not only discover activities but also describe them. We 
also think that the resulting discovered knowledge can be 
effectively used in many domains like multimedia interfaces 
and robot knowledge bases [52]. 
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